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THE  ESTIMATION  OF  NON-SAMPLING  VARIANCE  COMPONENTS  IN  SAMPLE  SURVEYS 

H.  0.  Hartley  and  J.  N.  K.  Rao 

1.  Introduction 

The  importance  of  non-sampling,  or  measurement  errors  has  long  been  recog- 
nized (for  the  numerous  references  see  e.g.,  the  comprehensive  papers  by  Hansen, 
Hurvitz  and  Bershad  (1961)  and  Bailar  and  Dalcnius  (1970)).  Briefly  the  various 
models  suggested  for  such  errors  assume  that  a survey  record  (recorded  content 
item)  differs  from  its  "true  value"  by  a systematic  bias,  B,  and  various  addi- 
tive error  contributions  associated  with  various  sources  of  errors  such  as,  inter- 
viewers, coders,  etc.  The  important  feature  of  these  models  is  that  the  errors 
made  by  a specified  error  source  (say  a particular  interviewer)  are  usually 
'correlated'.  These  correlated  errors  contribute  additive  components  to  the 
total  mean  square  error  of  a survey  estimate  which  do  not  decrease  inversely 
proportional  to  the  overall  sample  size  but  only  inversely  proportional  to  the  , 
number  of  interviewers,  coders,  etc.  Consequently,  the  application  of  standard 
text  book  formulas  for  the  estimation  of  the  variances  of  survey  estimates  may 
lead  to  serious  underestimates  of  the  real  variability  which  should  incorporate 
the  non-sampling  errors. 

Attempts  have,  therefore,  been  made  to  estimate  the  components  due  to  non- 
sampling errors.  The  early  work  in  this  area  has  concentrated  on  surveys 
specifically  designed  to  incorporate  features  facilitating  the  estimation  of 
non-sampling  components  such  as  reinterviews  and/or  interpenetrating  samples 
(see  e.g.  Sukhatme  and  Seth  (1952)  ).  However,  the  more  recent  literature  (see 
e.g.  Cochran  (1968),  Fellegi  (1969),  Nisselson  and  Bailar  (1976),  Battese,  Fuller 
and  Hickman  (1976)  ) has  also  treated  surveys  in  which  such  features  are  either 
lacking  or  limited,  but  these  results  are  restricted  to  simple  surveys  permitting 
the  use  of  analysis  of  variance  technique^. 
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In  this  paper  wc  provide  a general  methodology  applicable  to  essentially 
any  multistage  survey  In  which  the  last  stage  units  are  drawn  with  equal  prob- 
abilities. Specifically  our  formulas  for  the  estimated  variances  of  target  para- 
meter estimates  will  Include  all  finite  population  corrections  except  those  In  the 
last  stage  which  are  usually  negligible.  We  utilize  recent  results  In  the  estima- 
tion of  components  of  variance  In  mixed  linear  models  to  achieve  these  results 
and  are  able  to  address  the  problem  of  estimablllty  of  variance  components. 

2.  The  assumptions  made. 

In  this  paper  we  confine  ourselves  to  what  may  be  regarded  as  a special  case 
of  a more  general  model  which  we  hope  to  cover  In  a subsequent  paper.  Here  we 
assume  that: 

(2.1)  The  survey  has  a stratified  multistage  design  In  which  the  last 
stage  units  are  drawn  with  equal  probabilities  while  any  equal  - * 
or  unequal  - probability  design  may  be  specified  for  the  remain- 
ing stages. 

(2.2)  Error  sources  (such  as  Interviewers,  or  coders,  etc.)  contribute 
additive  errors  to  the  so  called  'content  items"  associated  with 
the  last  stage  units. 

(2.3)  All  "correlations"  between  the  errors  contributed  by  a particular 
(say  the  1^**)  error  source  are  generated  through  an  "additive 

• model".  That  Is  the  errors  have  the  structure  b^  * 6b^  where  b^ 

Is  an  error  contribution  from  the  1^^  source  common  to  all  units 
affected  by  the  1^  source  (all  units  Interviewed  by  the  1^^ 
Interviewer)  while  fib^,  sometimes  referred  to  as  an  "elementary 
non-sampling  error",  varies  randomly  from  unit  to  unit  (s). 
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(2.4)  The  present  paper  is  confined  to  the  case  where  there  is  no 
i (..systematic  bias  from  any  of  the  error  sources. 

K'  • 

We  should  state  here  that  the  above  assumptions  (2.2)  and  (2.3)  are  quite 
customary  in  the  literature  on  non-sampling  errors  (see  e.g.,  Sukhatme  & Seth 
(1952)  and  Bailar  & Dalenius  (1970)). 

Although  a bias  term  is  usually  included  in  the  formulas  occurring  in  the 
literature  it  can  only  be  evaluated  in  special  cases.  For  example,  it  may  be 
estimated  from  "special  record  checks."  We  do  not  discuss  bi asses  in  this  paper. 

3.  The  model  formulation. 

To  fix  the  ideas  expressed  in  2 we  confine  ourselves  to  two  types  of  error 
sources  without  loss  of  generality  described  as  "interviewers"  and  "coders". 
However,  generalizations  to  more  than  two  types  of  error  sources  do  not  afford  * 
any  difficulties.  Moreover,  to  simplify  the  notation,  we  introduce  the  two  in- 
dex label  (p,  s)  where  the  index  s labels  the  s^*’  elementary  unit  (briefly  refer- 
red to  as  "secondary")  and  the  index  p (briefly  called  the  primary  index)  is  a 
composite  label  indexing  the  last  but  one  stage  unit  within  the  next  higher  stage 
unit  ....  within  the  primary  unit  within  a stratum.  Thus,  for  example,  in  a 
three-stage  stratified  design  s will  denote  the  tertiary  unit  and  p will  be  a 
composite  index  for  a "secondary  within  a primary  within  a stratum." 

We  m^y  now  write  the  model  in  the  form 

y "n  +b.  + c+6b  + 6c_  (1) 

'ps  ^ps  i c ps  ps  ' ' 


where 

^ps  * recorded  for  elementary  unit  labeled  (p,  s). 


n = true  content  item  for  elementary  unit  labeled  (p,  s)/ 
ps 

*=  error  variable  contributed  by  i^^  interviewer  common  to  all  (p,  s) 
Interviewed  by  i^^  interviewer, 

c = error  variable  contributed  c^^  coder  common  to  all  (p,  s)  coded  by  c^^ 
c 

coder, 

6b  * elementary  interviewer  error  afflicting  the  content  item  of  unit 
ps 

(P.  s). 

6c^  “ elementary  coder  error  afflicting  the  content  item  of  unit  (p,  s). 

ps 

We  assume  that  the  b^  and  c^  are  respectively  random  samples  from  Infinite  popu- 
lations of  interviewer  and  coder  errors  with 


E(b^)  - 0 

E(Cg)  - 0 


and  Var  (b^)  = og 


and  Var  (c^)  » o| 


(2) 


The  assumptions  F(b^)  « E(c^)  > 0 postulate  the  absence  of  systematic  interviewer 
and  coder  biases. 

Likewise  we  assume  that 


E(6bpj.)  - 0 


V«r  (Sbpj)  • 


E(«p,)  ■ 0 


V.r  (scpj)  - cjj 


(3) 


The  common  Interviewer  errors  b^  and  common  coder  errors  c^  are  assumed  to 
be  Independent  from  one  another  and  Independent  of  the  true  content  Items 
and  the  elementary  errors  6b^,  6c^.  However  Hpj.  And  6b^  6c^  are  not  assumed  to 
be  Independent. 
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It  should  also  be  noted  that  and  apply  respectively,  to  the  elemen- 
tary errorrs  Qf- aJJ.  interviewers  and  aji  coders.  This  means  that  we  do  not,  in 
this  paper,  allow  for  the  possibility  of  heterogeneity  of  the  interviewers  and/ 
or  coders  elementary  error  variances. 

He  may  rewrite  the  model  (1)  in  the  form 


y-“n  +b.  + c + e 
'ps  >.  i ''c  ps 

where 


e ® (n  “ n ) 4b  + 6c 
ps  ' 'ps  p.  ps  ps 


(4) 


and  where 

Mp 

V'il.iV  <*> 

is  the  mean  of  the  over  the  Mp  elementary  units  in  the  p^  primary. 

The  essential  concept  in  our  approach  is  that  we  shall  only  estimate  the 

oi  _ ■ Var(e„)  for  each  primary,  p,  but  do  obtain  separate  estimates  for 
*»P  P* 

the  Var(n_.  - n_  ) (the  variances  of  the  true  sampling  errors)  or  the  Var  6b„, 
P*  ps 

Var  6c  (that  is,  the  elementary  non-sampling  variances)..  To  justify  this 
strategy  we  shall  show  that  the  variance  of  the  estimates  of  population  totals 
and  other  target  parameters  in  our  finite  population  likewise  only  Involves  the 
o|p  and  its  separate  components. 

4.  The  complete  specification  of  the  survey  design. 

As  stated  in  (2.1)  above  we  permit  any  specification  of  a stratified  multi- 
stage  design  in  which  the  last  stage  units  (the  'secondary  units'  indexed  (s))  are 
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drown  with  cqual^probobillty.  This  means 

i i *•  • . 

(4.1)  ' ■ tjbrt.'th^Q  design  specifies  In  advance  for  any  set  (p)  of  sampled 

>ifniMr4<s-4he' nunber,  m^,  of  secondary  units  to  be  drawn  with 
equal  probability  from  the  units  In  the  population. 

Moreover  we  shall  assume  for  any  set  of  sampled  (p) 

» 

(4.2)  that  the  design  specifies  the  number  of  Interviewers  (I)  and 
number  of  coders  (C)  which  will  be  labeled  1 = 1,  ...»  I; 

c = 1 , ...»  C,  and 

(4.3)  that  the  design  specifies  the  "work-load  assignment"  I.e.,  that 
It  specifies  In  advance  the  number  of  secondary  units  to  be  Inter 
viewed  by  Interviewer  1 in  each  primary  p and  likewise  the  num- 

to  be  coded  by  coder  c In  each  primary  p. 

Specificatfon  (4.1)  Is  quite  customary.  Specifications  (4.2)  and  (4.3) 

V t • •• 

are  only  conceptual  since  In  actual  practice  I and  C and  the  work-load  assign- 
ment will  often  not  be  decided  on  until  after  the  primary  sample  (p)  has  been 
drawn. 

In  what  follows  we  shall  further  assume  for  the  sake  of  simplifying  the 
argument  that  the  last  stage  (secondary)  sampling  fractions  "ip/Mp  all  negli- 
gibly small  so  that  the  sampled  n.e  ~ can  be  regarded  as  a random  sample  of 
mp  from  an  • population  with  mean  0.  The  Inclusion  of  the  finite  population 
corrections  will  be  discussed  In  the  second  paper.  We  do  not  assume  however, 
that  the  elementary  interviewer  and/or  coder  errors  ab^^  and/or  «Cpj  are  neces- 
sarily Independent  of  the  sampling  errors  Hpj  - JJp  , since  we  shall.  In  the  next 
section,  estimate  the  variances  of  the  composite  error  Op^  directly. 

5.  The  condltioml  estimation  of  variance  components. 


7 


Consider  a given  sample  of  primaries  (p)  drawn  In  accordance  with  the  design. 
Then  under  the  assumptions  made  In  >1. and  conditionally  on  (p)  the  model  (4)  will 
represent  a "mixed  analysis  of  variance  model"  where  the  , c^.  and  are  ran- 
dom variables  with  "variance  components"  og  (for  Interviewers),  o|  (for  coders) 
and  o^p  (for  "elementary  errors")  In  primary  p.  The  model  also  involves  "fixed 
constants"  n_  . / 

in  order  to  relate  the  model  to  the  notation  customary  In  variance  component 
estimation  methodology  we  write  It  In  the  form 


c 

y *=  Xa  + I U.b.  (6) 


where 


yis  the  vector  of  recorded  y^^  with  number  of  elements 
^ ps 


n 


« Is  the  n-vector  with  elements  iip,  the  population  means  for 

the  sampled  primaries,  (7) 

X Is  an  associated  M x n design  matrix  with  1's  In  the  column 
■ ^ps  primary  P, 

I-vector  of  Interviewer  variables  b^, 
b2  ■ C-vector  of  coder  variables  c^, 
bj  to  bjj^j  ■ mp-vectors  of  Op^  for.  p ■ 1,  ....  n, 

Uj  ■ associated  design  matrices  with  1's  In  those  columns  that 
' correspond  to  the  Interviewer,* coder  or  primary  of  the  unit 
• labeled  (p.-s). 
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There  is  a considerable  literature  on  "component  of  variance  estimation" 
in  the  unbalanced  mixed  ANOVA  Model  (for  a comprehensive  bibliography,  see  e.g., 

I 

Searle (1971)).  For  a computationally  simple  method  of  computing  estimates  of  i 

the  oj  we  refer  to  the  "synthesis  based  method"  by  Hartley,  Rao  and  LaMotte 
(1977)  which  is  a Minque  estimate  using  a particular  norm  and  which  enjoys  addi- 
• tional  optimality  properties  and  provides  conditions  for  estimability  as  follows:  - 
Introducing  the  matrices  “ Uj  - XX'Uj,  Hartley,  Rao  and  LaMotte  show 
that  the  o?  are  estimable  if  the  V^Vl  are  not  linearly  dependent  and  this  condi-  j 

tion  is  usually  satisfied  by  survey  designs.  In  any  case  the  condition  can  be 
tested  on  the  computer  in  advance  of  the  field  work  and  if  the  are  found 
to  be  dependent  this  can  usually  be  remedied  by  alteration  in  the  work  load  as- 
signment to  interviewers  and/or  coders. 

Because  of  the  assumptions  made  in  Section  3,  the' estimates  of  the  variance 

% 

components  oj  that  is,  og,  o^,  and  computed  from  the  sample  of  yp^  condi- 
tional on  a given  set  of  primaries  (p)  are  universally  unbiassed  estimates  of 
these  variance  components  and  will  be  available  for  estimates  of  variances  of 
target  estimators  computed  directly  from  the  survey  data. 

6.  Linear  estimates  of  target  parameters  and  their  variances. 

The  majority  of  estimators  of  target  parameters  (including  the  population 

t 

total  and  means)  which  are  computed  from  the  survey  sample  data  are  linear 
functions  of  the  yp^.  Since  sampling  within  primaries  is  with  equal  probabilities 
we  confine  ourselves  to  estimators  of  the  form 
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where  n is  the  n-vector  of  true  primary  means,  and  Ejp  is  the  cohditional  ex- 
pectation given  a set  (p)  of  sampled  primary's  and  E the  expectation  over  the 

P 

finite  population  survey  design  of  primaries.  If  so  called  "unbiased  extima- 
tors"  c'(p)n  of  target  parameters  have  been  chosen  then  clearly  from  (11) 
c'(p)y  will  be  unbiassed.  ^ 

We  now  turn  to  the  variance  formulas.  We  have 


Var  c'(p)y  = Var  El  c'(p)y  + E Varj  . c'(p)y  (12) 

pip  p Ip 

where  Varj  is  a conditional  variance  given  a set  of  primaries  (p)  while  Var  is 
Ip  p 

the  variance  for  the  finite  population  survey  design  of  primary  units. 

Turning  first  to  the  second  term  in  (12)  (the  "within  primary  component") 
we  have 


Varj  • c'(p)Sc(p) 
Ip 


where  the  n x n matrix  S is  the  conditional  covariance  matrix  of  the  yp  whose 


p,  V element  is  given  by 


Here  v(p,  t;  j)  is  the  nutrber  of  1 elements  in  the  t^  column  of  Uj  which  are 
in  rows  corresponding  to  units  (p,  s)  for  the  argument  primary  p of  v(p,  t;  J). 
The  v(p,  t;  J)  are  parameter's  which  are  predetermined  through  the  design  and 
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work  allocation  for  any  primary  sample  (p)  because  of  (4.1)  to  (4.3).  An  un- 
biassed estimate  of  E Varj  c'(p)y  Is  therefore  given  by 

P 'P 


>•„  - I It  Mp.  t:  J) 

“ j J t p ”p 


where  the  oj  are  the  component  of  variance  estimates  whose  computation  Is  des- 
cribed In  Section  5. 

Turning  next  to  the  "between  primary  variance  component"  In  (12)  we  have 


Var  El  c'(p)y  = Var  c'(p)n  (16) 

P 'p  P . 

Now  finite  population  sampling  theory  for  the  primary  units  (p)  regarded  as 
units  will  provide  a "variance  formula"  for  the  "estimator"  c'(p)r  In  the  form 

Var  c'(p)n  = V(n)  (17) 

P 

(where  n Is  the  N-vector  of  primary  means  In  the  finite  population  of  N primary 
means)  and  also  provide  an  unbiassed  estimate  of  V(n)  In  the  form 

a 

v(c'(p)n)  “ n'A^ 


E « V(n) 

. p 

In  the  above  example  of  two  stage  equal  probability  sampling  without  replace- 
ment we  have 
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V(J)  ■ J (1 


N 


- «>  pi,  <Vp  - - ’) 


(19) 


and 


*(C-(P)S)  . f (1  - fi)  _r^(Hpn-p  - MpSp)V(n  - 1) 


(20) 


where  MpHp  and  Mpnp  are  respectively  the  sample  and  population  means  of  the 


MpHp.  Equations  (19)  and  (20)  are  the  well-known  formulas  for  the  variance  and 


variance  estimate  of  the  N(Mpnp)  In  a simple  random  sample  of  n units  with  char- 


acteristics Mpilp  drawn  from  a finite  population  of  N units. 


Returning  to  the  general  case  (12)  an  unbiassed  estimate  of  Varg  of  the 


between  primary  component  of  Var(c'(p)y)  can  be  computed  from  the  yp^  through 


li 


varg  “ y'Ay  - tr  AS  (21) 

The  above  formula  (21)  cannot,  of  course,  claim  any  particular  properties 
other  than  unbiassedness.  However  numerical  experience  Indicates  that  the 
second  term  will  usually  be  negligible  compared  with  the  first. 

7.  Summary. 

To  summarize  we  have  provided  a method  of  estimating  the  overall  variance 
of  a linear  estimator  of  the  form  c'(p)y  which  Includes  the  non-sampling  errors 


i 
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for  any  stratified  multistage  design  in  which  the  last  stage  is  an  equal  prolm-  ^ 

} 

bility  selection  procedure.  The  estimate  of  the  variance  contains  two  components,  j 
i namely  a component  var  given  by  (15)  representing  variation  of  the  last  stage  , 

units  within  the  last  but  one  stage  units  plus  elementary  measurement  errors. 

The  second  component  varg  given  by  (21)  represents  a composite  of  compon-  t 

. ents  due  to  variation  of  the  higher  stage  units  each  within  the  units  of  next  ' 

I 

higher  stage.  The  "within  component"  var^  involves  estimated  variance  components 
* ^ 

computed  by  simple  mixed  model  ANOVA  techniques.  The  "between  component" 

* 

also  involves  these  o?  in  the  correction  term  -trAS  with  S = (S_  ) given  by 

j P** 

(14).  However  its  leading  term  y'Ay  Is  a quadratic  form  in  the  last  but  one 
stage  sample  means  y^  directly  provided  by  standard  estimation  of  variance  formu- 
las in  finite  population  sampling  and  including  all  finite  population  correc- 
tions for  the  higher  stages. 

Simple  numerical  examples  will  be  provided  in  our  next  paper. 
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